Case study 2: Linear regression model to predict sales price of properties
Contents

Case study 2: Linear regression model to predict sales price of properties¶
Business Objective¶
This dataset contains properties sold in New York City over a 12-month period from September 2016 to September 2017. The objective is to build a model to predict sale value in the future.
Load Dataset¶
#Download dataset
#!wget -q https://www.dropbox.com/s/6tc7e6rc395c7jz/nyc-property-sales.zip
#Unzip the data
#!unzip nyc-property-sales.zip > /dev/null; echo " done."
#!ls
#Install Packages
#!pip -q install plotly-express
#!pip -q install shap
#!pip -q install eli5
#!pip -q install lime
Import Packages¶
#Import basic packages
import warnings
warnings.filterwarnings("ignore")
import time
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split #splitting data
from pylab import rcParams
from sklearn.linear_model import LinearRegression #linear regression
from sklearn.metrics.regression import mean_squared_error #error metrics
from sklearn.metrics import mean_absolute_error
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_21860\4029770141.py in <module>
4 warnings.filterwarnings("ignore")
5 import time
----> 6 import pandas as pd
7 import numpy as np
8 import pickle
ModuleNotFoundError: No module named 'pandas'
# Code for displaying plotly express plots inline in colab
def configure_plotly_browser_state():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
},
});
</script>
'''))
import plotly_express as px
Read Data¶
1. Through Pandas¶
# Read data through Pandas and compute time taken to read
t_start = time.time()
df_prop = pd.read_csv('nyc-rolling-sales.csv')
t_end = time.time()
print('pd.read_csv(): {} s'.format(t_end-t_start)) # time [s]
df_prop.head()
pd.read_csv(): 1.7584502696990967 s
| Unnamed: 0 | BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | EASE-MENT | BUILDING CLASS AT PRESENT | ADDRESS | ... | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 392 | 6 | C2 | 153 AVENUE B | ... | 5 | 0 | 5 | 1633 | 6440 | 1900 | 2 | C2 | 6625000 | 2017-07-19 00:00:00 | |
| 1 | 5 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 26 | C7 | 234 EAST 4TH STREET | ... | 28 | 3 | 31 | 4616 | 18690 | 1900 | 2 | C7 | - | 2016-12-14 00:00:00 | |
| 2 | 6 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2 | 399 | 39 | C7 | 197 EAST 3RD STREET | ... | 16 | 1 | 17 | 2212 | 7803 | 1900 | 2 | C7 | - | 2016-12-09 00:00:00 | |
| 3 | 7 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2B | 402 | 21 | C4 | 154 EAST 7TH STREET | ... | 10 | 0 | 10 | 2272 | 6794 | 1913 | 2 | C4 | 3936272 | 2016-09-23 00:00:00 | |
| 4 | 8 | 1 | ALPHABET CITY | 07 RENTALS - WALKUP APARTMENTS | 2A | 404 | 55 | C2 | 301 EAST 10TH STREET | ... | 6 | 0 | 6 | 2369 | 4615 | 1900 | 2 | C2 | 8000000 | 2016-11-17 00:00:00 |
5 rows × 22 columns
2. Through Dask¶
# Read data through Dask and compute time taken to read
import dask.dataframe as dd
t_start = time.time()
df_dask = dd.read_csv('nyc-rolling-sales.csv')
t_end = time.time()
print('dd.read_csv(): {} s'.format(t_end-t_start)) # time [s]
df_dask.tail()
dd.read_csv(): 0.12152266502380371 s
| Unnamed: 0 | BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | EASE-MENT | BUILDING CLASS AT PRESENT | ADDRESS | ... | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84543 | 8409 | 5 | WOODROW | 02 TWO FAMILY DWELLINGS | 1 | 7349 | 34 | B9 | 37 QUAIL LANE | ... | 2 | 0 | 2 | 2400 | 2575 | 1998 | 1 | B9 | 450000 | 2016-11-28 00:00:00 | |
| 84544 | 8410 | 5 | WOODROW | 02 TWO FAMILY DWELLINGS | 1 | 7349 | 78 | B9 | 32 PHEASANT LANE | ... | 2 | 0 | 2 | 2498 | 2377 | 1998 | 1 | B9 | 550000 | 2017-04-21 00:00:00 | |
| 84545 | 8411 | 5 | WOODROW | 02 TWO FAMILY DWELLINGS | 1 | 7351 | 60 | B2 | 49 PITNEY AVENUE | ... | 2 | 0 | 2 | 4000 | 1496 | 1925 | 1 | B2 | 460000 | 2017-07-05 00:00:00 | |
| 84546 | 8412 | 5 | WOODROW | 22 STORE BUILDINGS | 4 | 7100 | 28 | K6 | 2730 ARTHUR KILL ROAD | ... | 0 | 7 | 7 | 208033 | 64117 | 2001 | 4 | K6 | 11693337 | 2016-12-21 00:00:00 | |
| 84547 | 8413 | 5 | WOODROW | 35 INDOOR PUBLIC AND CULTURAL FACILITIES | 4 | 7105 | 679 | P9 | 155 CLAY PIT ROAD | ... | 0 | 1 | 1 | 10796 | 2400 | 2006 | 4 | P9 | 69300 | 2016-10-27 00:00:00 |
5 rows × 22 columns
Observation: Dask is almost 10 times faster than Pandas when reading files.
3. Subsample into n rows¶
df_prop.shape
(84548, 22)
# Sample n rows
n = df_prop.shape[0]
df_prop = df_prop.sample(n)
df_prop.shape
(84548, 22)
Exploratory Data Analysis¶
Let’s look into summary statistics¶
#Let's look into the total number of columns and observations in the dataset
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84548 entries, 26286 to 50843
Data columns (total 22 columns):
Unnamed: 0 84548 non-null int64
BOROUGH 84548 non-null int64
NEIGHBORHOOD 84548 non-null object
BUILDING CLASS CATEGORY 84548 non-null object
TAX CLASS AT PRESENT 84548 non-null object
BLOCK 84548 non-null int64
LOT 84548 non-null int64
EASE-MENT 84548 non-null object
BUILDING CLASS AT PRESENT 84548 non-null object
ADDRESS 84548 non-null object
APARTMENT NUMBER 84548 non-null object
ZIP CODE 84548 non-null int64
RESIDENTIAL UNITS 84548 non-null int64
COMMERCIAL UNITS 84548 non-null int64
TOTAL UNITS 84548 non-null int64
LAND SQUARE FEET 84548 non-null object
GROSS SQUARE FEET 84548 non-null object
YEAR BUILT 84548 non-null int64
TAX CLASS AT TIME OF SALE 84548 non-null int64
BUILDING CLASS AT TIME OF SALE 84548 non-null object
SALE PRICE 84548 non-null object
SALE DATE 84548 non-null object
dtypes: int64(10), object(12)
memory usage: 14.8+ MB
df_prop.isnull().sum()
Unnamed: 0 0
BOROUGH 0
NEIGHBORHOOD 0
BUILDING CLASS CATEGORY 0
TAX CLASS AT PRESENT 0
BLOCK 0
LOT 0
EASE-MENT 0
BUILDING CLASS AT PRESENT 0
ADDRESS 0
APARTMENT NUMBER 0
ZIP CODE 0
RESIDENTIAL UNITS 0
COMMERCIAL UNITS 0
TOTAL UNITS 0
LAND SQUARE FEET 0
GROSS SQUARE FEET 0
YEAR BUILT 0
TAX CLASS AT TIME OF SALE 0
BUILDING CLASS AT TIME OF SALE 0
SALE PRICE 0
SALE DATE 0
dtype: int64
#Let's look into summary statistics of data
df_prop.describe()
| Unnamed: 0 | BOROUGH | BLOCK | LOT | ZIP CODE | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | YEAR BUILT | TAX CLASS AT TIME OF SALE | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 | 84548.000000 |
| mean | 10344.359878 | 2.998758 | 4237.218976 | 376.224015 | 10731.991614 | 2.025264 | 0.193559 | 2.249184 | 1789.322976 | 1.657485 |
| std | 7151.779436 | 1.289790 | 3568.263407 | 658.136814 | 1290.879147 | 16.721037 | 8.713183 | 18.972584 | 537.344993 | 0.819341 |
| min | 4.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4231.000000 | 2.000000 | 1322.750000 | 22.000000 | 10305.000000 | 0.000000 | 0.000000 | 1.000000 | 1920.000000 | 1.000000 |
| 50% | 8942.000000 | 3.000000 | 3311.000000 | 50.000000 | 11209.000000 | 1.000000 | 0.000000 | 1.000000 | 1940.000000 | 2.000000 |
| 75% | 15987.250000 | 4.000000 | 6281.000000 | 1001.000000 | 11357.000000 | 2.000000 | 0.000000 | 2.000000 | 1965.000000 | 2.000000 |
| max | 26739.000000 | 5.000000 | 16322.000000 | 9106.000000 | 11694.000000 | 1844.000000 | 2261.000000 | 2261.000000 | 2017.000000 | 4.000000 |
Observation:
There is a column called Unnamed: 0 which is not required as it contains only continuous index numbers
The datatypes of saleprice is not correct because the summary statistics of sale price is not displayed
Hence there is a lot of data cleaning to perform.
Data Cleaning¶
Pandas profiling¶
'''#Perform Pandas profiling to understand quick overview of columns
import pandas_profiling
report = pandas_profiling.ProfileReport(df_prop)
#covert profile report as html file
report.to_file("property_data.html")'''
Removal of unnecessary columns¶
Observation:
From Pandas profiling we understand EASEMENT column has no significant value and thus has to be removed.
# Let's explore why EASE-MENT has to be rejected
df_prop['EASE-MENT'].unique()
array([' '], dtype=object)
#This column has no significance other than being an iterator
del df_prop['Unnamed: 0']
#This column has no significant value
del df_prop['EASE-MENT']
Observation:
From Pandas profiling we understand SALE PRICEcolumns have string value in some rows and thus has to be removed.
From Pandas Profiling we understand LAND SQUARE FEET and GROSS SQUARE FEET columns have string values which have to replaced by appropriate values
df_prop['SALE PRICE'] = df_prop['SALE PRICE'].replace(' - ',np.nan)
df_prop.dropna(inplace=True)
df_prop.shape
(69987, 20)
df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].replace(' - ',np.nan)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].replace(' - ',np.nan)
# count the number of NaN values in each column
print(df_prop.isnull().sum())
BOROUGH 0
NEIGHBORHOOD 0
BUILDING CLASS CATEGORY 0
TAX CLASS AT PRESENT 0
BLOCK 0
LOT 0
BUILDING CLASS AT PRESENT 0
ADDRESS 0
APARTMENT NUMBER 0
ZIP CODE 0
RESIDENTIAL UNITS 0
COMMERCIAL UNITS 0
TOTAL UNITS 0
LAND SQUARE FEET 21188
GROSS SQUARE FEET 21739
YEAR BUILT 0
TAX CLASS AT TIME OF SALE 0
BUILDING CLASS AT TIME OF SALE 0
SALE PRICE 0
SALE DATE 0
dtype: int64
df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].replace('0',np.nan)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].replace('0',np.nan)
print(df_prop.isnull().sum())
BOROUGH 0
NEIGHBORHOOD 0
BUILDING CLASS CATEGORY 0
TAX CLASS AT PRESENT 0
BLOCK 0
LOT 0
BUILDING CLASS AT PRESENT 0
ADDRESS 0
APARTMENT NUMBER 0
ZIP CODE 0
RESIDENTIAL UNITS 0
COMMERCIAL UNITS 0
TOTAL UNITS 0
LAND SQUARE FEET 31514
GROSS SQUARE FEET 33156
YEAR BUILT 0
TAX CLASS AT TIME OF SALE 0
BUILDING CLASS AT TIME OF SALE 0
SALE PRICE 0
SALE DATE 0
dtype: int64
df_prop.describe()
| BOROUGH | BLOCK | LOT | ZIP CODE | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | YEAR BUILT | TAX CLASS AT TIME OF SALE | |
|---|---|---|---|---|---|---|---|---|---|
| count | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 |
| mean | 2.921928 | 4196.072528 | 373.828397 | 10741.455185 | 1.899553 | 0.172489 | 2.092203 | 1799.348236 | 1.641976 |
| std | 1.235688 | 3429.196524 | 656.096528 | 1263.234938 | 14.549545 | 9.123717 | 17.276100 | 520.884552 | 0.771162 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 1348.000000 | 22.000000 | 10306.000000 | 0.000000 | 0.000000 | 0.000000 | 1920.000000 | 1.000000 |
| 50% | 3.000000 | 3378.000000 | 50.000000 | 11209.000000 | 1.000000 | 0.000000 | 1.000000 | 1937.000000 | 2.000000 |
| 75% | 4.000000 | 6186.000000 | 709.000000 | 11249.000000 | 2.000000 | 0.000000 | 2.000000 | 1965.000000 | 2.000000 |
| max | 5.000000 | 16319.000000 | 9106.000000 | 11694.000000 | 1844.000000 | 2261.000000 | 2261.000000 | 2017.000000 | 4.000000 |
## Define a function impute_median and fill land square feet and gross square feet with median values
def impute_median(series):
return series.fillna(series.median())
df_prop['LAND SQUARE FEET'] = df_prop['LAND SQUARE FEET'].transform(impute_median)
df_prop['GROSS SQUARE FEET'] = df_prop['GROSS SQUARE FEET'].transform(impute_median)
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH 69987 non-null int64
NEIGHBORHOOD 69987 non-null object
BUILDING CLASS CATEGORY 69987 non-null object
TAX CLASS AT PRESENT 69987 non-null object
BLOCK 69987 non-null int64
LOT 69987 non-null int64
BUILDING CLASS AT PRESENT 69987 non-null object
ADDRESS 69987 non-null object
APARTMENT NUMBER 69987 non-null object
ZIP CODE 69987 non-null int64
RESIDENTIAL UNITS 69987 non-null int64
COMMERCIAL UNITS 69987 non-null int64
TOTAL UNITS 69987 non-null int64
LAND SQUARE FEET 69987 non-null object
GROSS SQUARE FEET 69987 non-null object
YEAR BUILT 69987 non-null int64
TAX CLASS AT TIME OF SALE 69987 non-null int64
BUILDING CLASS AT TIME OF SALE 69987 non-null object
SALE PRICE 69987 non-null object
SALE DATE 69987 non-null object
dtypes: int64(9), object(11)
memory usage: 11.2+ MB
df_prop.describe()
| BOROUGH | BLOCK | LOT | ZIP CODE | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | YEAR BUILT | TAX CLASS AT TIME OF SALE | |
|---|---|---|---|---|---|---|---|---|---|
| count | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 | 69987.000000 |
| mean | 2.921928 | 4196.072528 | 373.828397 | 10741.455185 | 1.899553 | 0.172489 | 2.092203 | 1799.348236 | 1.641976 |
| std | 1.235688 | 3429.196524 | 656.096528 | 1263.234938 | 14.549545 | 9.123717 | 17.276100 | 520.884552 | 0.771162 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 1348.000000 | 22.000000 | 10306.000000 | 0.000000 | 0.000000 | 0.000000 | 1920.000000 | 1.000000 |
| 50% | 3.000000 | 3378.000000 | 50.000000 | 11209.000000 | 1.000000 | 0.000000 | 1.000000 | 1937.000000 | 2.000000 |
| 75% | 4.000000 | 6186.000000 | 709.000000 | 11249.000000 | 2.000000 | 0.000000 | 2.000000 | 1965.000000 | 2.000000 |
| max | 5.000000 | 16319.000000 | 9106.000000 | 11694.000000 | 1844.000000 | 2261.000000 | 2261.000000 | 2017.000000 | 4.000000 |
#Convert few column datatypes into appropriate ones for conserving memory
df_prop['TAX CLASS AT TIME OF SALE'] = df_prop['TAX CLASS AT TIME OF SALE'].astype('category')
df_prop['TAX CLASS AT PRESENT'] = df_prop['TAX CLASS AT PRESENT'].astype('category')
df_prop['LAND SQUARE FEET'] = pd.to_numeric(df_prop['LAND SQUARE FEET'], errors='coerce')
df_prop['GROSS SQUARE FEET']= pd.to_numeric(df_prop['GROSS SQUARE FEET'], errors='coerce')
df_prop['SALE PRICE'] = pd.to_numeric(df_prop['SALE PRICE'], errors='coerce')
df_prop['BOROUGH'] = df_prop['BOROUGH'].astype('category')
#The datatypes have now been changed
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH 69987 non-null category
NEIGHBORHOOD 69987 non-null object
BUILDING CLASS CATEGORY 69987 non-null object
TAX CLASS AT PRESENT 69987 non-null category
BLOCK 69987 non-null int64
LOT 69987 non-null int64
BUILDING CLASS AT PRESENT 69987 non-null object
ADDRESS 69987 non-null object
APARTMENT NUMBER 69987 non-null object
ZIP CODE 69987 non-null int64
RESIDENTIAL UNITS 69987 non-null int64
COMMERCIAL UNITS 69987 non-null int64
TOTAL UNITS 69987 non-null int64
LAND SQUARE FEET 69987 non-null float64
GROSS SQUARE FEET 69987 non-null float64
YEAR BUILT 69987 non-null int64
TAX CLASS AT TIME OF SALE 69987 non-null category
BUILDING CLASS AT TIME OF SALE 69987 non-null object
SALE PRICE 69987 non-null int64
SALE DATE 69987 non-null object
dtypes: category(3), float64(2), int64(8), object(7)
memory usage: 9.8+ MB
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 69987 entries, 26286 to 73846
Data columns (total 20 columns):
BOROUGH 69987 non-null category
NEIGHBORHOOD 69987 non-null object
BUILDING CLASS CATEGORY 69987 non-null object
TAX CLASS AT PRESENT 69987 non-null category
BLOCK 69987 non-null int64
LOT 69987 non-null int64
BUILDING CLASS AT PRESENT 69987 non-null object
ADDRESS 69987 non-null object
APARTMENT NUMBER 69987 non-null object
ZIP CODE 69987 non-null int64
RESIDENTIAL UNITS 69987 non-null int64
COMMERCIAL UNITS 69987 non-null int64
TOTAL UNITS 69987 non-null int64
LAND SQUARE FEET 69987 non-null float64
GROSS SQUARE FEET 69987 non-null float64
YEAR BUILT 69987 non-null int64
TAX CLASS AT TIME OF SALE 69987 non-null category
BUILDING CLASS AT TIME OF SALE 69987 non-null object
SALE PRICE 69987 non-null int64
SALE DATE 69987 non-null object
dtypes: category(3), float64(2), int64(8), object(7)
memory usage: 9.8+ MB
# Let's remove sale price with a nonsensically small dollar amount: $0 most commonly.
# Since these sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
df_prop = df_prop[df_prop['SALE PRICE']!=0]
#Let's also remove observations that have gross square feet less than 400 sq. ft
#Let's also remove observations that have gross square feet less than 400 sq. ft
#Let's also remove observations that have sale price than 1000 dollars
df_prop = df_prop[df_prop['GROSS SQUARE FEET']>400]
df_prop = df_prop[df_prop['LAND SQUARE FEET']>400]
df_prop = df_prop[df_prop['SALE PRICE']>1000]
df_prop.describe()
| BLOCK | LOT | ZIP CODE | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | SALE PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 58544.000000 | 58544.000000 | 58544.000000 | 58544.000000 | 58544.000000 | 58544.000000 | 5.854400e+04 | 5.854400e+04 | 58544.000000 | 5.854400e+04 |
| mean | 4137.998565 | 391.834227 | 10707.269695 | 1.706716 | 0.163877 | 1.882259 | 3.622433e+03 | 3.259362e+03 | 1812.223917 | 1.524977e+06 |
| std | 3566.123396 | 670.343056 | 1251.444132 | 14.179364 | 9.922390 | 17.412926 | 3.541128e+04 | 2.380047e+04 | 502.376457 | 1.245480e+07 |
| min | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.050000e+02 | 4.060000e+02 | 0.000000 | 1.110000e+03 |
| 25% | 1271.750000 | 21.000000 | 10280.000000 | 0.000000 | 0.000000 | 0.000000 | 2.500000e+03 | 1.914750e+03 | 1920.000000 | 3.850000e+05 |
| 50% | 3145.500000 | 50.000000 | 11207.000000 | 1.000000 | 0.000000 | 1.000000 | 2.500000e+03 | 2.000000e+03 | 1940.000000 | 6.400000e+05 |
| 75% | 6099.000000 | 1002.000000 | 11356.000000 | 1.000000 | 0.000000 | 2.000000 | 2.500000e+03 | 2.000000e+03 | 1966.000000 | 1.098618e+06 |
| max | 16319.000000 | 9106.000000 | 11694.000000 | 1844.000000 | 2261.000000 | 2261.000000 | 4.252327e+06 | 3.750565e+06 | 2017.000000 | 2.210000e+09 |
df_prop[df_prop['SALE PRICE']==2210000000]
| BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | BUILDING CLASS AT PRESENT | ADDRESS | APARTMENT NUMBER | ZIP CODE | RESIDENTIAL UNITS | COMMERCIAL UNITS | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7447 | 1 | MIDTOWN CBD | 21 OFFICE BUILDINGS | 4 | 1301 | 1 | O4 | 245 PARK AVENUE | 10167 | 0 | 35 | 35 | 81336.0 | 1586886.0 | 1966 | 4 | O4 | 2210000000 | 2017-05-05 00:00:00 |
Observation: The most expensive property in NYC is a whopping 2 billion dollars which can be considered as an outlier.
Let’s remove outiers!¶
q = df_prop["SALE PRICE"].quantile(0.99)
q
14000000.0
df_prop = df_prop[df_prop["SALE PRICE"] < q]
df_prop_lin = df_prop.copy()
# Convert sale date into time,month,year and day
df_prop['SALE DATE']=pd.to_datetime(df_prop['SALE DATE'])
df_prop['year']=df_prop['SALE DATE'].dt.year
df_prop['month']=df_prop['SALE DATE'].dt.month
df_prop['day']=df_prop['SALE DATE'].dt.day
df_prop['time']=df_prop['SALE DATE'].dt.hour
df_prop['day_week']=df_prop['SALE DATE'].dt.weekday_name
df_prop.head()
| BOROUGH | NEIGHBORHOOD | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | BUILDING CLASS AT PRESENT | ADDRESS | APARTMENT NUMBER | ZIP CODE | ... | YEAR BUILT | TAX CLASS AT TIME OF SALE | BUILDING CLASS AT TIME OF SALE | SALE PRICE | SALE DATE | year | month | day | time | day_week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26286 | 3 | BAY RIDGE | 10 COOPS - ELEVATOR APARTMENTS | 2 | 6089 | 44 | D4 | 9040 FORT HAMILTON PARKWA, 76 | 11209 | ... | 1955 | 2 | D4 | 219000 | 2017-03-09 | 2017 | 3 | 9 | 0 | Thursday | |
| 55023 | 4 | ELMHURST | 07 RENTALS - WALKUP APARTMENTS | 2 | 1487 | 61 | C1 | 40-03 76TH STREET | 11373 | ... | 1927 | 2 | C1 | 3250000 | 2016-11-02 | 2016 | 11 | 2 | 0 | Wednesday | |
| 56763 | 4 | FLUSHING-NORTH | 01 ONE FAMILY DWELLINGS | 1 | 7357 | 34 | A5 | 48-05 WEEKS LANE | 11365 | ... | 1950 | 1 | A5 | 715000 | 2016-09-09 | 2016 | 9 | 9 | 0 | Friday | |
| 47664 | 3 | WILLIAMSBURG-CENTRAL | 13 CONDOS - ELEVATOR APARTMENTS | 2 | 2245 | 1319 | R4 | 80 LORIMER STREET | 3B | 11206 | ... | 2013 | 2 | R4 | 578920 | 2016-09-29 | 2016 | 9 | 29 | 0 | Thursday |
| 47236 | 3 | SUNSET PARK | 07 RENTALS - WALKUP APARTMENTS | 2A | 685 | 4 | C2 | 893 4 AVENUE | 11232 | ... | 1925 | 2 | C2 | 1250000 | 2017-08-09 | 2017 | 8 | 9 | 0 | Wednesday |
5 rows × 25 columns
df_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 25 columns):
BOROUGH 57948 non-null category
NEIGHBORHOOD 57948 non-null object
BUILDING CLASS CATEGORY 57948 non-null object
TAX CLASS AT PRESENT 57948 non-null category
BLOCK 57948 non-null int64
LOT 57948 non-null int64
BUILDING CLASS AT PRESENT 57948 non-null object
ADDRESS 57948 non-null object
APARTMENT NUMBER 57948 non-null object
ZIP CODE 57948 non-null int64
RESIDENTIAL UNITS 57948 non-null int64
COMMERCIAL UNITS 57948 non-null int64
TOTAL UNITS 57948 non-null int64
LAND SQUARE FEET 57948 non-null float64
GROSS SQUARE FEET 57948 non-null float64
YEAR BUILT 57948 non-null int64
TAX CLASS AT TIME OF SALE 57948 non-null category
BUILDING CLASS AT TIME OF SALE 57948 non-null object
SALE PRICE 57948 non-null int64
SALE DATE 57948 non-null datetime64[ns]
year 57948 non-null int64
month 57948 non-null int64
day 57948 non-null int64
time 57948 non-null int64
day_week 57948 non-null object
dtypes: category(3), datetime64[ns](1), float64(2), int64(12), object(7)
memory usage: 10.3+ MB
Data Visualization¶
#Assign numbered bouroughs to bourough names
dic = {1: 'Manhattan', 2: 'Bronx', 3: 'Brooklyn', 4: 'Queens', 5:'Staten Island'}
df_prop["borough_name"] = df_prop["BOROUGH"].apply(lambda x: dic[x])
Count of properties in NYC in each bororugh¶
%matplotlib inline
df_prop.borough_name.value_counts().nlargest().plot(kind='bar', figsize=(10,5))
plt.title("Number of properties by city")
plt.ylabel('Number of properties')
plt.xlabel('City');
plt.show()
Distribution of Sale Price¶
df_prop['SALE PRICE'].describe()
count 5.794800e+04
mean 1.056701e+06
std 1.478971e+06
min 1.110000e+03
25% 3.820000e+05
50% 6.311065e+05
75% 1.054651e+06
max 1.399500e+07
Name: SALE PRICE, dtype: float64
Observation: The maximum sale price is 14 million
df_prop['SALE PRICE'].plot.hist(bins=20, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.show()
Observation: The distribution is highly skewed towards the right which implies there are lesser properties that have a very high prices.
sns.boxplot(df_prop["SALE PRICE"])
<matplotlib.axes._subplots.AxesSubplot at 0x13bca947c88>
df_prop["log_price"] = np.log(df_prop["SALE PRICE"] + 1)
sns.boxplot(df_prop.log_price)
<matplotlib.axes._subplots.AxesSubplot at 0x13bc8eaa588>
Correlation between selected variables¶
The heat map produces a correlation plot between variables of the dataframe.
plt.figure(figsize=(15,10))
c = df_prop[df_prop.columns.values[0:19]].corr()
sns.heatmap(c,cmap="BrBG",annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x13bcc65fb70>
Observation: The heat map illustrates that sale price is independent of all column values that could be considered for linear regression.
Explore how gross square feet affects sale price¶
configure_plotly_browser_state()
px.scatter(df_prop, x="GROSS SQUARE FEET", y="SALE PRICE", size ="TOTAL UNITS" ,color="borough_name",
hover_data=["BUILDING CLASS CATEGORY","LOT"], log_x=True, size_max=60)
Observation:
Properties with more total units do not fetch larger sales price
Properties in Staten Island have comparitively lesser sales price in comparison with other boroughs in New york city
Explore how tax class at the time of sale affect sales price¶
configure_plotly_browser_state()
px.box(df_prop, x="borough_name", y="SALE PRICE", color="TAX CLASS AT TIME OF SALE",hover_data=['NEIGHBORHOOD', 'BUILDING CLASS CATEGORY'],notched=True)
Observation:
Manhatten has the highest priced properties that have a tax class 1 representing residential property of up to three units (such as one-,two-, and three-family homes and small stores or offices with one or two attached apartments) as compared to other boroughs.
Properties in Staten Island have comparitively lesser sales price in comparison with other boroughs in New york city
configure_plotly_browser_state()
px.box(df_prop, x="day_week", y="SALE PRICE", color="TAX CLASS AT TIME OF SALE", notched=True)
Observation:
On Saturdays there are no sales for the tax class 4 which represents properties such as such as offices, factories, warehouses, garage buildings, etc.
Model Building¶
Prepare the Data for model building¶
Delete columns not necessary for prediction¶
df_prop_lin.columns
Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'BUILDING CLASS AT PRESENT',
'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE', 'RESIDENTIAL UNITS',
'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND SQUARE FEET',
'GROSS SQUARE FEET', 'YEAR BUILT', 'TAX CLASS AT TIME OF SALE',
'BUILDING CLASS AT TIME OF SALE', 'SALE PRICE', 'SALE DATE'],
dtype='object')
#Dropping few columns
del df_prop_lin['BUILDING CLASS AT PRESENT']
del df_prop_lin['BUILDING CLASS AT TIME OF SALE']
del df_prop_lin['NEIGHBORHOOD']
del df_prop_lin['ADDRESS']
del df_prop_lin['SALE DATE']
del df_prop_lin['APARTMENT NUMBER']
del df_prop_lin['RESIDENTIAL UNITS']
del df_prop_lin['COMMERCIAL UNITS']
df_prop_lin.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 12 columns):
BOROUGH 57948 non-null category
BUILDING CLASS CATEGORY 57948 non-null object
TAX CLASS AT PRESENT 57948 non-null category
BLOCK 57948 non-null int64
LOT 57948 non-null int64
ZIP CODE 57948 non-null int64
TOTAL UNITS 57948 non-null int64
LAND SQUARE FEET 57948 non-null float64
GROSS SQUARE FEET 57948 non-null float64
YEAR BUILT 57948 non-null int64
TAX CLASS AT TIME OF SALE 57948 non-null category
SALE PRICE 57948 non-null int64
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 4.6+ MB
df_prop_lin.head()
| BOROUGH | BUILDING CLASS CATEGORY | TAX CLASS AT PRESENT | BLOCK | LOT | ZIP CODE | TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | YEAR BUILT | TAX CLASS AT TIME OF SALE | SALE PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26286 | 3 | 10 COOPS - ELEVATOR APARTMENTS | 2 | 6089 | 44 | 11209 | 0 | 2500.0 | 2000.0 | 1955 | 2 | 219000 |
| 55023 | 4 | 07 RENTALS - WALKUP APARTMENTS | 2 | 1487 | 61 | 11373 | 16 | 4000.0 | 9740.0 | 1927 | 2 | 3250000 |
| 56763 | 4 | 01 ONE FAMILY DWELLINGS | 1 | 7357 | 34 | 11365 | 1 | 2760.0 | 1188.0 | 1950 | 1 | 715000 |
| 47664 | 3 | 13 CONDOS - ELEVATOR APARTMENTS | 2 | 2245 | 1319 | 11206 | 1 | 2500.0 | 2000.0 | 2013 | 2 | 578920 |
| 47236 | 3 | 07 RENTALS - WALKUP APARTMENTS | 2A | 685 | 4 | 11232 | 6 | 2587.0 | 4000.0 | 1925 | 2 | 1250000 |
season = [winter, rainy, summer]
season_rainy, season_summer
File "<ipython-input-53-99e89a64e56d>", line 2
season_rainy, season_summer
^
IndentationError: unexpected indent
Perform one-hot encoding for categorical variables¶
#Select the variables to be one-hot encoded
one_hot_features = ['BOROUGH', 'BUILDING CLASS CATEGORY','TAX CLASS AT PRESENT','TAX CLASS AT TIME OF SALE']
# Convert categorical variables into dummy/indicator variables (i.e. one-hot encoding).
one_hot_encoded = pd.get_dummies(df_prop_lin[one_hot_features],drop_first=True)
one_hot_encoded
#one_hot_encoded.info(verbose=True, memory_usage=True, null_counts=True)
| BOROUGH_2 | BOROUGH_3 | BOROUGH_4 | BOROUGH_5 | BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS | BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS | BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS | BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND | BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER | BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS | ... | TAX CLASS AT PRESENT_1C | TAX CLASS AT PRESENT_2 | TAX CLASS AT PRESENT_2A | TAX CLASS AT PRESENT_2B | TAX CLASS AT PRESENT_2C | TAX CLASS AT PRESENT_3 | TAX CLASS AT PRESENT_4 | TAX CLASS AT TIME OF SALE_2 | TAX CLASS AT TIME OF SALE_3 | TAX CLASS AT TIME OF SALE_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26286 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 55023 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 56763 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 47664 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 47236 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 83254 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 82112 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10044 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 44548 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48289 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 46216 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 234 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 24472 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 73449 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 76944 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 72359 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 40709 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 42960 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 80212 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 44195 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30963 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6279 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 61077 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 44119 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 22973 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 55237 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7829 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 74118 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11537 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 55407 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 75092 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 83271 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 80674 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14337 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 40328 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6497 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 33037 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 65621 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 65798 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 29235 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13274 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 44693 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 38986 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 17621 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 12390 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 24038 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 73328 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 724 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 36921 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 19536 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31975 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 80899 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 45516 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 15179 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 72025 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 26264 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 56193 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 65888 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11837 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 73846 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
57948 rows × 60 columns
df_prop_lin.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 12 columns):
BOROUGH 57948 non-null category
BUILDING CLASS CATEGORY 57948 non-null object
TAX CLASS AT PRESENT 57948 non-null category
BLOCK 57948 non-null int64
LOT 57948 non-null int64
ZIP CODE 57948 non-null int64
TOTAL UNITS 57948 non-null int64
LAND SQUARE FEET 57948 non-null float64
GROSS SQUARE FEET 57948 non-null float64
YEAR BUILT 57948 non-null int64
TAX CLASS AT TIME OF SALE 57948 non-null category
SALE PRICE 57948 non-null int64
dtypes: category(3), float64(2), int64(6), object(1)
memory usage: 4.6+ MB
one_hot_encoded.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 60 columns):
BOROUGH_2 57948 non-null uint8
BOROUGH_3 57948 non-null uint8
BOROUGH_4 57948 non-null uint8
BOROUGH_5 57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS 57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL 57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS 57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES 57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES 57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES 57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES 57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES 57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC 57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING 57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE 57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS 57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS 57948 non-null uint8
TAX CLASS AT PRESENT_1 57948 non-null uint8
TAX CLASS AT PRESENT_1A 57948 non-null uint8
TAX CLASS AT PRESENT_1B 57948 non-null uint8
TAX CLASS AT PRESENT_1C 57948 non-null uint8
TAX CLASS AT PRESENT_2 57948 non-null uint8
TAX CLASS AT PRESENT_2A 57948 non-null uint8
TAX CLASS AT PRESENT_2B 57948 non-null uint8
TAX CLASS AT PRESENT_2C 57948 non-null uint8
TAX CLASS AT PRESENT_3 57948 non-null uint8
TAX CLASS AT PRESENT_4 57948 non-null uint8
TAX CLASS AT TIME OF SALE_2 57948 non-null uint8
TAX CLASS AT TIME OF SALE_3 57948 non-null uint8
TAX CLASS AT TIME OF SALE_4 57948 non-null uint8
dtypes: uint8(60)
memory usage: 3.8 MB
# Replacing categorical columns with dummies
fdf = df_prop_lin.drop(one_hot_features,axis=1)
fdf = pd.concat([fdf, one_hot_encoded] ,axis=1)
fdf.info()
#print (fdf.shape)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 68 columns):
BLOCK 57948 non-null int64
LOT 57948 non-null int64
ZIP CODE 57948 non-null int64
TOTAL UNITS 57948 non-null int64
LAND SQUARE FEET 57948 non-null float64
GROSS SQUARE FEET 57948 non-null float64
YEAR BUILT 57948 non-null int64
SALE PRICE 57948 non-null int64
BOROUGH_2 57948 non-null uint8
BOROUGH_3 57948 non-null uint8
BOROUGH_4 57948 non-null uint8
BOROUGH_5 57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS 57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL 57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS 57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES 57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES 57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES 57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES 57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES 57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC 57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING 57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE 57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS 57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS 57948 non-null uint8
TAX CLASS AT PRESENT_1 57948 non-null uint8
TAX CLASS AT PRESENT_1A 57948 non-null uint8
TAX CLASS AT PRESENT_1B 57948 non-null uint8
TAX CLASS AT PRESENT_1C 57948 non-null uint8
TAX CLASS AT PRESENT_2 57948 non-null uint8
TAX CLASS AT PRESENT_2A 57948 non-null uint8
TAX CLASS AT PRESENT_2B 57948 non-null uint8
TAX CLASS AT PRESENT_2C 57948 non-null uint8
TAX CLASS AT PRESENT_3 57948 non-null uint8
TAX CLASS AT PRESENT_4 57948 non-null uint8
TAX CLASS AT TIME OF SALE_2 57948 non-null uint8
TAX CLASS AT TIME OF SALE_3 57948 non-null uint8
TAX CLASS AT TIME OF SALE_4 57948 non-null uint8
dtypes: float64(2), int64(6), uint8(60)
memory usage: 7.3 MB
del fdf['LOT']
del fdf['BLOCK']
del fdf['ZIP CODE']
del fdf['YEAR BUILT']
fdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57948 entries, 26286 to 73846
Data columns (total 64 columns):
TOTAL UNITS 57948 non-null int64
LAND SQUARE FEET 57948 non-null float64
GROSS SQUARE FEET 57948 non-null float64
SALE PRICE 57948 non-null int64
BOROUGH_2 57948 non-null uint8
BOROUGH_3 57948 non-null uint8
BOROUGH_4 57948 non-null uint8
BOROUGH_5 57948 non-null uint8
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_05 TAX CLASS 1 VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_06 TAX CLASS 1 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11 SPECIAL CONDO BILLING LOTS 57948 non-null uint8
BUILDING CLASS CATEGORY_11A CONDO-RENTALS 57948 non-null uint8
BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS 57948 non-null uint8
BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL 57948 non-null uint8
BUILDING CLASS CATEGORY_16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT 57948 non-null uint8
BUILDING CLASS CATEGORY_17 CONDO COOPS 57948 non-null uint8
BUILDING CLASS CATEGORY_21 OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_22 STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_23 LOFT BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_26 OTHER HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_27 FACTORIES 57948 non-null uint8
BUILDING CLASS CATEGORY_28 COMMERCIAL CONDOS 57948 non-null uint8
BUILDING CLASS CATEGORY_29 COMMERCIAL GARAGES 57948 non-null uint8
BUILDING CLASS CATEGORY_30 WAREHOUSES 57948 non-null uint8
BUILDING CLASS CATEGORY_31 COMMERCIAL VACANT LAND 57948 non-null uint8
BUILDING CLASS CATEGORY_32 HOSPITAL AND HEALTH FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_33 EDUCATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_34 THEATRES 57948 non-null uint8
BUILDING CLASS CATEGORY_35 INDOOR PUBLIC AND CULTURAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_36 OUTDOOR RECREATIONAL FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_37 RELIGIOUS FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_38 ASYLUMS AND HOMES 57948 non-null uint8
BUILDING CLASS CATEGORY_39 TRANSPORTATION FACILITIES 57948 non-null uint8
BUILDING CLASS CATEGORY_41 TAX CLASS 4 - OTHER 57948 non-null uint8
BUILDING CLASS CATEGORY_42 CONDO CULTURAL/MEDICAL/EDUCATIONAL/ETC 57948 non-null uint8
BUILDING CLASS CATEGORY_43 CONDO OFFICE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_44 CONDO PARKING 57948 non-null uint8
BUILDING CLASS CATEGORY_45 CONDO HOTELS 57948 non-null uint8
BUILDING CLASS CATEGORY_46 CONDO STORE BUILDINGS 57948 non-null uint8
BUILDING CLASS CATEGORY_47 CONDO NON-BUSINESS STORAGE 57948 non-null uint8
BUILDING CLASS CATEGORY_48 CONDO TERRACES/GARDENS/CABANAS 57948 non-null uint8
BUILDING CLASS CATEGORY_49 CONDO WAREHOUSES/FACTORY/INDUS 57948 non-null uint8
TAX CLASS AT PRESENT_1 57948 non-null uint8
TAX CLASS AT PRESENT_1A 57948 non-null uint8
TAX CLASS AT PRESENT_1B 57948 non-null uint8
TAX CLASS AT PRESENT_1C 57948 non-null uint8
TAX CLASS AT PRESENT_2 57948 non-null uint8
TAX CLASS AT PRESENT_2A 57948 non-null uint8
TAX CLASS AT PRESENT_2B 57948 non-null uint8
TAX CLASS AT PRESENT_2C 57948 non-null uint8
TAX CLASS AT PRESENT_3 57948 non-null uint8
TAX CLASS AT PRESENT_4 57948 non-null uint8
TAX CLASS AT TIME OF SALE_2 57948 non-null uint8
TAX CLASS AT TIME OF SALE_3 57948 non-null uint8
TAX CLASS AT TIME OF SALE_4 57948 non-null uint8
dtypes: float64(2), int64(2), uint8(60)
memory usage: 5.5 MB
Training and Test data¶
X = fdf.drop(['SALE PRICE'],axis=1)
print (X)
y = fdf['SALE PRICE']
fdf_normalized = pd.DataFrame(data=X, index=X.index, columns=X.columns)
print (y)
TOTAL UNITS LAND SQUARE FEET GROSS SQUARE FEET BOROUGH_2 BOROUGH_3 \
26286 0 2500.0 2000.0 0 1
55023 16 4000.0 9740.0 0 0
56763 1 2760.0 1188.0 0 0
47664 1 2500.0 2000.0 0 1
47236 6 2587.0 4000.0 0 1
83254 1 9800.0 1428.0 0 0
82112 1 7750.0 1298.0 0 0
10044 0 2500.0 2000.0 0 0
44548 1 2500.0 2000.0 0 1
48289 3 1910.0 2760.0 0 1
46216 1 2500.0 2000.0 0 1
234 7 2469.0 19892.0 0 0
24472 1 2375.0 1425.0 1 0
73449 1 2905.0 1152.0 0 0
76944 2 2880.0 1344.0 0 0
72359 1 2190.0 784.0 0 0
40709 3 2500.0 3150.0 0 1
42960 2 1353.0 1428.0 0 1
80212 2 2712.0 1800.0 0 0
44195 1 2000.0 1372.0 0 1
30963 1 2500.0 2000.0 0 1
6279 1 2500.0 2000.0 0 0
61077 0 2500.0 2000.0 0 0
44119 1 2500.0 2000.0 0 1
22973 0 2500.0 2000.0 1 0
55237 1 2500.0 2000.0 0 0
7829 0 2500.0 2000.0 0 0
74118 1 3000.0 1744.0 0 0
11537 1 2500.0 2000.0 0 0
55407 1 2500.0 2000.0 0 0
... ... ... ... ... ...
75092 2 2500.0 1552.0 0 0
83271 1 899.0 1488.0 0 0
80674 1 4000.0 2124.0 0 0
14337 0 2500.0 2000.0 0 0
40328 1 2500.0 2000.0 0 1
6497 1 2500.0 2000.0 0 0
33037 0 2500.0 2000.0 0 1
65621 1 3760.0 1878.0 0 0
65798 2 2500.0 4000.0 0 0
29235 2 2003.0 1920.0 0 1
13274 0 2500.0 2000.0 0 0
44693 0 2500.0 2000.0 0 1
38986 4 2700.0 3600.0 0 1
17621 0 2500.0 2000.0 0 0
12390 0 2500.0 2000.0 0 0
24038 0 2500.0 2000.0 1 0
73328 2 5000.0 1800.0 0 0
724 0 2500.0 2000.0 0 0
36921 0 1500.0 2000.0 0 1
19536 2 2146.0 2760.0 1 0
31975 0 2500.0 2000.0 0 1
80899 2 2604.0 1400.0 0 0
45516 1 2500.0 2000.0 0 1
15179 0 2500.0 2000.0 0 0
72025 1 2000.0 1624.0 0 0
26264 0 2500.0 2000.0 0 1
56193 1 3800.0 1624.0 0 0
65888 1 5000.0 1073.0 0 0
11837 0 2500.0 2000.0 0 0
73846 2 2926.0 2082.0 0 0
BOROUGH_4 BOROUGH_5 \
26286 0 0
55023 1 0
56763 1 0
47664 0 0
47236 0 0
83254 0 1
82112 0 1
10044 0 0
44548 0 0
48289 0 0
46216 0 0
234 0 0
24472 0 0
73449 1 0
76944 0 1
72359 1 0
40709 0 0
42960 0 0
80212 0 1
44195 0 0
30963 0 0
6279 0 0
61077 1 0
44119 0 0
22973 0 0
55237 1 0
7829 0 0
74118 1 0
11537 0 0
55407 1 0
... ... ...
75092 1 0
83271 0 1
80674 0 1
14337 0 0
40328 0 0
6497 0 0
33037 0 0
65621 1 0
65798 1 0
29235 0 0
13274 0 0
44693 0 0
38986 0 0
17621 0 0
12390 0 0
24038 0 0
73328 1 0
724 0 0
36921 0 0
19536 0 0
31975 0 0
80899 0 1
45516 0 0
15179 0 0
72025 1 0
26264 0 0
56193 1 0
65888 1 0
11837 0 0
73846 1 0
BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS \
26286 0
55023 0
56763 0
47664 0
47236 0
83254 0
82112 0
10044 0
44548 0
48289 0
46216 0
234 0
24472 0
73449 0
76944 1
72359 0
40709 0
42960 1
80212 1
44195 0
30963 0
6279 0
61077 0
44119 0
22973 0
55237 0
7829 0
74118 0
11537 0
55407 0
... ...
75092 1
83271 0
80674 0
14337 0
40328 0
6497 0
33037 0
65621 0
65798 1
29235 1
13274 0
44693 0
38986 0
17621 0
12390 0
24038 0
73328 1
724 0
36921 0
19536 1
31975 0
80899 1
45516 0
15179 0
72025 0
26264 0
56193 0
65888 0
11837 0
73846 1
BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS \
26286 0
55023 0
56763 0
47664 0
47236 0
83254 0
82112 0
10044 0
44548 0
48289 1
46216 0
234 0
24472 0
73449 0
76944 0
72359 0
40709 1
42960 0
80212 0
44195 0
30963 0
6279 0
61077 0
44119 0
22973 0
55237 0
7829 0
74118 0
11537 0
55407 0
... ...
75092 0
83271 0
80674 0
14337 0
40328 0
6497 0
33037 0
65621 0
65798 0
29235 0
13274 0
44693 0
38986 0
17621 0
12390 0
24038 0
73328 0
724 0
36921 0
19536 0
31975 0
80899 0
45516 0
15179 0
72025 0
26264 0
56193 0
65888 0
11837 0
73846 0
BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS \
26286 0
55023 0
56763 0
47664 0
47236 0
83254 0
82112 0
10044 0
44548 1
48289 0
46216 1
234 0
24472 0
73449 0
76944 0
72359 0
40709 0
42960 0
80212 0
44195 0
30963 0
6279 0
61077 0
44119 0
22973 0
55237 0
7829 0
74118 0
11537 0
55407 0
... ...
75092 0
83271 0
80674 0
14337 0
40328 1
6497 0
33037 0
65621 0
65798 0
29235 0
13274 0
44693 0
38986 0
17621 0
12390 0
24038 0
73328 0
724 0
36921 0
19536 0
31975 0
80899 0
45516 0
15179 0
72025 0
26264 0
56193 0
65888 0
11837 0
73846 0
... TAX CLASS AT PRESENT_1C TAX CLASS AT PRESENT_2 \
26286 ... 0 1
55023 ... 0 1
56763 ... 0 0
47664 ... 0 1
47236 ... 0 0
83254 ... 0 0
82112 ... 0 0
10044 ... 0 1
44548 ... 1 0
48289 ... 0 0
46216 ... 1 0
234 ... 0 0
24472 ... 0 0
73449 ... 0 0
76944 ... 0 0
72359 ... 0 0
40709 ... 0 0
42960 ... 0 0
80212 ... 0 0
44195 ... 0 0
30963 ... 0 1
6279 ... 0 1
61077 ... 0 1
44119 ... 0 0
22973 ... 0 1
55237 ... 0 1
7829 ... 0 1
74118 ... 0 0
11537 ... 0 1
55407 ... 0 0
... ... ... ...
75092 ... 0 0
83271 ... 0 0
80674 ... 0 0
14337 ... 0 1
40328 ... 0 0
6497 ... 0 1
33037 ... 0 0
65621 ... 0 0
65798 ... 0 0
29235 ... 0 0
13274 ... 0 1
44693 ... 0 0
38986 ... 0 0
17621 ... 0 1
12390 ... 0 1
24038 ... 0 1
73328 ... 0 0
724 ... 0 0
36921 ... 0 0
19536 ... 0 0
31975 ... 0 1
80899 ... 0 0
45516 ... 0 1
15179 ... 0 1
72025 ... 0 0
26264 ... 0 1
56193 ... 0 0
65888 ... 0 0
11837 ... 0 0
73846 ... 0 0
TAX CLASS AT PRESENT_2A TAX CLASS AT PRESENT_2B \
26286 0 0
55023 0 0
56763 0 0
47664 0 0
47236 1 0
83254 0 0
82112 0 0
10044 0 0
44548 0 0
48289 0 0
46216 0 0
234 0 1
24472 0 0
73449 0 0
76944 0 0
72359 0 0
40709 0 0
42960 0 0
80212 0 0
44195 0 0
30963 0 0
6279 0 0
61077 0 0
44119 0 0
22973 0 0
55237 0 0
7829 0 0
74118 0 0
11537 0 0
55407 0 0
... ... ...
75092 0 0
83271 0 0
80674 0 0
14337 0 0
40328 0 0
6497 0 0
33037 0 0
65621 0 0
65798 0 0
29235 0 0
13274 0 0
44693 0 0
38986 1 0
17621 0 0
12390 0 0
24038 0 0
73328 0 0
724 0 0
36921 0 0
19536 0 0
31975 0 0
80899 0 0
45516 0 0
15179 0 0
72025 0 0
26264 0 0
56193 0 0
65888 0 0
11837 0 0
73846 0 0
TAX CLASS AT PRESENT_2C TAX CLASS AT PRESENT_3 \
26286 0 0
55023 0 0
56763 0 0
47664 0 0
47236 0 0
83254 0 0
82112 0 0
10044 0 0
44548 0 0
48289 0 0
46216 0 0
234 0 0
24472 0 0
73449 0 0
76944 0 0
72359 0 0
40709 0 0
42960 0 0
80212 0 0
44195 0 0
30963 0 0
6279 0 0
61077 0 0
44119 0 0
22973 0 0
55237 0 0
7829 0 0
74118 0 0
11537 0 0
55407 0 0
... ... ...
75092 0 0
83271 0 0
80674 0 0
14337 0 0
40328 0 0
6497 0 0
33037 1 0
65621 0 0
65798 0 0
29235 0 0
13274 0 0
44693 1 0
38986 0 0
17621 0 0
12390 0 0
24038 0 0
73328 0 0
724 0 0
36921 0 0
19536 0 0
31975 0 0
80899 0 0
45516 0 0
15179 0 0
72025 0 0
26264 0 0
56193 0 0
65888 0 0
11837 0 0
73846 0 0
TAX CLASS AT PRESENT_4 TAX CLASS AT TIME OF SALE_2 \
26286 0 1
55023 0 1
56763 0 0
47664 0 1
47236 0 1
83254 0 0
82112 0 0
10044 0 1
44548 0 0
48289 0 0
46216 0 0
234 0 1
24472 0 0
73449 0 0
76944 0 0
72359 0 0
40709 0 0
42960 0 0
80212 0 0
44195 0 0
30963 0 1
6279 0 1
61077 0 1
44119 1 0
22973 0 1
55237 0 1
7829 0 1
74118 0 0
11537 0 1
55407 1 0
... ... ...
75092 0 0
83271 0 0
80674 0 0
14337 0 1
40328 0 0
6497 0 1
33037 0 1
65621 0 0
65798 0 0
29235 0 0
13274 0 1
44693 0 1
38986 0 1
17621 0 1
12390 0 1
24038 0 1
73328 0 0
724 0 1
36921 0 0
19536 0 0
31975 0 1
80899 0 0
45516 0 1
15179 0 1
72025 0 0
26264 0 1
56193 0 0
65888 0 0
11837 0 1
73846 0 0
TAX CLASS AT TIME OF SALE_3 TAX CLASS AT TIME OF SALE_4
26286 0 0
55023 0 0
56763 0 0
47664 0 0
47236 0 0
83254 0 0
82112 0 0
10044 0 0
44548 0 0
48289 0 0
46216 0 0
234 0 0
24472 0 0
73449 0 0
76944 0 0
72359 0 0
40709 0 0
42960 0 0
80212 0 0
44195 0 0
30963 0 0
6279 0 0
61077 0 0
44119 0 1
22973 0 0
55237 0 0
7829 0 0
74118 0 0
11537 0 0
55407 0 1
... ... ...
75092 0 0
83271 0 0
80674 0 0
14337 0 0
40328 0 0
6497 0 0
33037 0 0
65621 0 0
65798 0 0
29235 0 0
13274 0 0
44693 0 0
38986 0 0
17621 0 0
12390 0 0
24038 0 0
73328 0 0
724 0 0
36921 0 0
19536 0 0
31975 0 0
80899 0 0
45516 0 0
15179 0 0
72025 0 0
26264 0 0
56193 0 0
65888 0 0
11837 0 0
73846 0 0
[57948 rows x 63 columns]
26286 219000
55023 3250000
56763 715000
47664 578920
47236 1250000
83254 610000
82112 650000
10044 670000
44548 1325000
48289 1950000
46216 478000
234 4600000
24472 410000
73449 360500
76944 389900
72359 225000
40709 2400000
42960 300000
80212 287279
44195 445000
30963 3598989
6279 385000
61077 338412
44119 5000
22973 520000
55237 545000
7829 527000
74118 569900
11537 8500000
55407 150000
...
75092 825000
83271 340000
80674 695000
14337 615000
40328 670000
6497 800000
33037 356250
65621 681159
65798 225000
29235 900000
13274 706000
44693 948800
38986 660000
17621 875000
12390 850000
24038 190000
73328 790000
724 1600000
36921 100000
19536 550000
31975 255000
80899 515000
45516 885877
15179 162500
72025 405000
26264 305000
56193 900000
65888 685000
11837 3015000
73846 603750
Name: SALE PRICE, Length: 57948, dtype: int64
fdf_normalized.head(6)
| TOTAL UNITS | LAND SQUARE FEET | GROSS SQUARE FEET | BOROUGH_2 | BOROUGH_3 | BOROUGH_4 | BOROUGH_5 | BUILDING CLASS CATEGORY_02 TWO FAMILY DWELLINGS | BUILDING CLASS CATEGORY_03 THREE FAMILY DWELLINGS | BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS | ... | TAX CLASS AT PRESENT_1C | TAX CLASS AT PRESENT_2 | TAX CLASS AT PRESENT_2A | TAX CLASS AT PRESENT_2B | TAX CLASS AT PRESENT_2C | TAX CLASS AT PRESENT_3 | TAX CLASS AT PRESENT_4 | TAX CLASS AT TIME OF SALE_2 | TAX CLASS AT TIME OF SALE_3 | TAX CLASS AT TIME OF SALE_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26286 | 0 | 2500.0 | 2000.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 55023 | 16 | 4000.0 | 9740.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 56763 | 1 | 2760.0 | 1188.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 47664 | 1 | 2500.0 | 2000.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 47236 | 6 | 2587.0 | 4000.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 83254 | 1 | 9800.0 | 1428.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 rows × 63 columns
Split the data into train and test¶
X_train, X_test, y_train, y_test = train_test_split(fdf_normalized,y)
X_test.shape
(14487, 63)
X_train.shape
(43461, 63)
Train the model¶
# initialize the model
lr= LinearRegression()
# fit the model
model_fit=lr.fit(X_train,y_train)
print (model_fit.coef_)
print (model_fit.intercept_)
y= 3*Area + 7*LandSqfeet - 9xTax_Class+16
[-1.93526117e+03 1.09938376e+00 1.51695891e+00 -1.58952134e+06
-1.11906190e+06 -1.35676903e+06 -1.57092132e+06 8.01383873e+04
2.33315865e+05 -2.16414274e+06 -6.02943932e+04 -1.67683715e+05
3.04148063e+06 5.68560182e+06 -2.04595243e+06 -1.93379681e+06
-1.71505712e+06 1.65251503e+05 -1.63014307e+06 -1.06722527e+06
3.45510299e+06 -1.33788139e+06 -7.25208862e+05 -2.18460893e+06
7.02864152e+05 8.21351186e+05 3.53918153e+06 -1.88313563e+06
1.63066930e+06 -1.10633429e+06 1.10187196e+05 8.64417632e+05
-2.03417228e+05 3.02767117e+06 2.09679511e+06 1.81778072e+06
8.96376694e+04 -7.71142596e+05 6.82083625e+05 7.63769534e+05
-1.79297008e+06 -1.70220859e+05 -8.93666560e+05 -2.81805615e+05
-1.83891930e+06 -2.62056702e+06 1.21997273e+06 -2.61822687e+06
-1.72754459e+06 -1.99179836e+06 -2.24451025e+06 -2.65777004e+05
-2.22173899e+06 2.94333424e+05 -5.99135136e+05 -4.41890764e+06
-3.29384363e+06 -2.97796449e+05 -4.65661287e-10 -6.18578672e+05
-2.92436926e+05 0.00000000e+00 -5.33367450e+05]
4274465.50771582
Test the model¶
#predict on test data
test_pred = model_fit.predict(X_test)
#Answers provided by us in the exam
#y_test => Answer Key
test_pred
array([2331681.15819392, 1763246.76111356, 1182925.97682065, ...,
556292.18400827, 830775.68400524, 1187032.87294663])
test_null = y_test.mean()
#mean squared error
mse=mean_squared_error(y_test,test_null)
#root mean squared error
print('test rmse: {}'.format(np.sqrt(mse)))
#mean absolute error
#mae=mean_absolute_error(y_train,train_pred)
#print('train mae: {}'.format(mae))
test rmse: 1217441.6866231528
test_null = np.zeros_like(y_test, dtype=float)
test_null.fill(y_test.mean())
mse=mean_squared_error(y_test,test_null)
print('Null rmse: {}'.format(np.sqrt(mse)))
Null rmse: 1483445.904070305
len(test_pred)
14487
Model Explainability¶
LIME¶
X_test.values[4].shape
(63,)
# Import lime package
import lime
import lime.lime_tabular
#Find caegorical features
categorical_features = np.argwhere(np.array([len(set(X_test.values[:,x])) for x in range(X_test.values.shape[1])]) <= 10).flatten()
#Lime explainer for regression
explainer = lime.lime_tabular.LimeTabularExplainer(X_test.values,
feature_names=X_test.columns.values.tolist(),
class_names=['PriceCrore'],
categorical_features=categorical_features,
verbose=True, mode='regression')
ind = 4
#Fit on test data
exp = explainer.explain_instance(X_test.values[ind], model_fit.predict, num_features=6)
#Show in notebook features influencing predictions
exp.show_in_notebook(show_table=True)
Intercept -3825067.6591951344
Prediction_local [2025956.16261294]
Right: 1200453.3973587556
ELI5¶
# Import Eli5 package
import eli5
from eli5.sklearn import PermutationImportance
# Find the importance of columns for prediction
perm = PermutationImportance(model_fit, random_state=1).fit(X_test,test_pred)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())
| Weight | Feature |
|---|---|
| 3.4368 ± 0.0219 | TAX CLASS AT PRESENT_1 |
| 1.6247 ± 0.0207 | BUILDING CLASS CATEGORY_10 COOPS - ELEVATOR APARTMENTS |
| 1.1412 ± 0.0290 | TAX CLASS AT PRESENT_2A |
| 1.0982 ± 0.0217 | BOROUGH_4 |
| 0.6989 ± 0.0157 | BUILDING CLASS CATEGORY_07 RENTALS - WALKUP APARTMENTS |
| 0.6719 ± 0.0082 | BOROUGH_3 |
| 0.6370 ± 0.0085 | BOROUGH_5 |
| 0.5511 ± 0.0085 | BOROUGH_2 |
| 0.5004 ± 0.0022 | BUILDING CLASS CATEGORY_09 COOPS - WALKUP APARTMENTS |
| 0.4493 ± 0.0051 | BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS |
| 0.2710 ± 0.0035 | BUILDING CLASS CATEGORY_04 TAX CLASS 1 CONDOS |
| 0.2566 ± 0.0034 | BUILDING CLASS CATEGORY_17 CONDO COOPS |
| 0.2437 ± 0.0063 | TAX CLASS AT PRESENT_2 |
| 0.2019 ± 0.0053 | BUILDING CLASS CATEGORY_14 RENTALS - 4-10 UNIT |
| 0.1964 ± 0.0050 | TAX CLASS AT PRESENT_2B |
| 0.1054 ± 0.0000 | BUILDING CLASS CATEGORY_08 RENTALS - ELEVATOR APARTMENTS |
| 0.1045 ± 0.0015 | TAX CLASS AT PRESENT_1B |
| 0.0915 ± 0.0011 | BUILDING CLASS CATEGORY_15 CONDOS - 2-10 UNIT RESIDENTIAL |
| 0.0827 ± 0.0019 | BUILDING CLASS CATEGORY_12 CONDOS - WALKUP APARTMENTS |
| 0.0595 ± 0.0007 | TAX CLASS AT TIME OF SALE_2 |
| … 43 more … | |
#Understanding how each feature influences the prediction
eli5.show_prediction(model_fit, doc=X_test.iloc[[ind]], feature_names=list(X_test.columns))
Wage = 1*Education + 300000
Wage = 300002
Wage = 300003
Wage = 300040
y (score 1200453.397) top features
| Contribution? | Feature |
|---|---|
| +4274465.508 | <BIAS> |
| +3033.918 | GROSS SQUARE FEET |
| +2748.459 | LAND SQUARE FEET |
| -1935.261 | TOTAL UNITS |
| -292436.926 | TAX CLASS AT TIME OF SALE_2 |
| -599135.136 | TAX CLASS AT PRESENT_2 |
| -1067225.268 | BUILDING CLASS CATEGORY_13 CONDOS - ELEVATOR APARTMENTS |
| -1119061.895 | BOROUGH_3 |
SHAP¶
#Import SHAP package
import shap
#Create explainer for linear model
explainer = shap.LinearExplainer(model_fit,data=X_test.values)
shap_values = explainer.shap_values(X_test)
#Understanding how each feature influences the prediction
shap.initjs()
ind = 11
shap.force_plot(
explainer.expected_value, shap_values[ind,:], X_test.iloc[ind,:],
feature_names=X_test.columns.tolist()
)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
shap.summary_plot(shap_values,X_test)